Internet as Corpus-Automatic Construction of a Swedish News Corpus

نویسنده

  • Martin Duneld
چکیده

This paper describes the automatic building of a corpus of short Swedish news texts from the Internet, its application and possible future use. The corpus is aimed at research on Information Retrieval, Information Extraction, Named Entity Recognition and Multi Text Summarization. The corpus has been constructed by using an Internet agent, the so called newsAgent, downloading Swedish news text from various sources. A small part of this corpus has then been manually tagged with keywords and named entities. The newsAgent is also used as a workbench for processing the abundant flows of news texts for various users in a customized format in the application Nyhetsguiden.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Internet as Corpus Automatic Construction of a Swedish News Corpus

This paper describes the automatic building of a corpus of short Swedish news texts from the Internet, its application and possible future use. The corpus is aimed at research on Information Retrieval, Information Extraction, Named Entity Recognition and Multi Text Summarization. The corpus has been constructed by using an Internet agent, the so called newsAgent, downloading Swedish news text f...

متن کامل

Development of a Swedish Corpus for Evaluating Summarizers and other IR-tools

We are presenting the construction of a Swedish corpus aimed at research on Information Retrieval, Information Extraction, Named Entity Recognition and Multi Text Summarization, we will also present the results on evaluating our Swedish text summarizer SweSum with this corpus. The corpus has been constructed by using Internet agents downloading Swedish newspaper text from various sources. A sma...

متن کامل

Prediction of intonation patterns of accented words in a corpus of read Swedish news

This paper describes an initial attempt at the construction of a data-driven model of Swedish intonation. The study is mainly concerned with model building and prediction of the intonation patterns of accented words in a corpus of read news in Swedish. Extraction of pitch information is achieved by performing a stylization of the pitch contours. The information is used to build a model for the ...

متن کامل

Prediction of intonation patterns of accented words in a corpus of read Swedish news through pitch contour stylization

This paper describes an initial attempt at the construction of a data-driven model of Swedish intonation. The study is mainly concerned with model-building and prediction of the intonation patterns of accented words in a corpus of read news in Swedish. Extraction of pitch information is achieved by performing a stylization of the pitch contours. The information is used to build a model for the ...

متن کامل

A Comparative Analysis of Institutional Identities in a Corpus of English and Persian News Interviews

Institutional identity as a concept in CDA is a field of study that deals with the identities that individuals in institutions obtain, one that merits deep research attention. News interviews as institutional instances can be analyzed based on the impersonal structures because interviewees see themselves as part of the institution and they may not take responsibility when they encounter problem...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001